View on GitHub

Super-bowl-50-twitter-statistics

Some statistics taken using Twitter StreamingAPI to see Twitter user reactions to Super Bowl events

Download this project as a .zip file Download this project as a tar.gz file

Data Wrangling

This section describes the techniques we used to transform the data into a format that we could perform analysis on. After we collected the tweets we had hundreds of separate .json files. So, we wrote a python script to combine all the files into one for each category.

Code that merges multiple JSON files into one JSON file

          
import json
import csv
import glob
import os
import ast

#create and open output file
if not os.path.exists("json_merged/"):
        os.makedirs("json_merged/")
outfile_name = "json_merged/halftime_show_merged.json"
print "Merging jsons"

with open(outfile_name, "a") as outfile:
    
    path = "C:\python27\streaming data\*.json"
    for json_file in glob.glob(path):
        
        #open json file and parse data while writing into output file
        with open(json_file, 'r') as open_json_file:
            outfile.write(open_json_file.read())
          
        

Now the data is a little easier to handle, from a few hundred files down to four. But, we still need to pull out the information we need and put it into csv format. To do this we wrote another short python script that loads in our large json file and parses each tweet for the data we are looking for.

Code that writes each line (tweet) from JSON file to a CSV file in CSV format

        
import json
import csv
import os
import ast
    
with open ("C:\python27\scripts\json_merged\halftime_show.cvs", "a") as out_file:
    
    #create csv writer
    csv = csv.writer(out_file)
    #write header to out file
    print >> out_file, 'tweet_id, tweet_time, tweet_author, tweet_author_id, tweet_language, tweet_geo, tweet_text'
    #open json file and parse data
    with open("C:\python27\scripts\json_merged\halftime_show_merged.json", 'r') as open_json_file:
        
        #Get each tweet
        for line in open_json_file:
            try:
                tweet = json.loads(line)
                # row represents the attributes we are pulling from each tweet
                row = (
                    tweet['id'],                    # tweet_id
                    tweet['created_at'],            # tweet_time
                    tweet['user']['screen_name'],   # tweet_author
                    tweet['user']['location'],      # tweeter location
                    tweet['user']['id_str'],        # tweet_authod_id
                    tweet['lang'],                  # tweet_language
                    tweet['geo'],                   # tweet_geo
                    tweet['text'],                  # tweet_text
                    tweet['timestamp_ms']           # tweet time in ms
                )
                values = [(value.encode('utf8') if hasattr(value, 'encode') else value) for value in row]
                csv.writerow(values)
            except:
                pass 
        
      

After running this script the tweets are now in a clean csv file that can be imported to R studio for analysis.